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Abstract 



Polar codes are recursive general concatenated codes. This property motivates a recursive formal- 
ization of the known decoding algorithms: Successive Cancellation, Successive Cancellation with Lists 
and Belief Propagation. This description allows an easy development of the first two algorithms for 
arbitrary polarizing kernels. Hardware architectures for these decoding algorithms are also described 
in a recursive way, both for Arikan's standard polar codes and for arbitrary polarizing kernels. 

1 Introduction 

Polar codes were introduced by Arikan [1] and provided a scheme for achieving the symmetric capacity of 
binary memoryless channels (B-MC) with polynomial encoding and decoding complexities. Arikan used 
the so-called (u + v, v) construction, which is based on the following linear kernel 



In this scheme, a 2 n x 2™ matrix, Gj , is generated by performing the Kronecker power on G2. An input 
vector u of length N = 2 n is transformed to an N length vector x by multiplying a certain permutation 
of the vector u by n . The vector x is transmitted through N independent copies of the memoryless 
channel, W . This results in new N (dependent) channels between the individual components of u and 
the outputs of the channels. Arikan showed that these channels exhibit the phenomenon of polarization 
under Successive Cancellation (SC) decoding. This means that as n grows, there is a proportion of I(W) 
(the symmetric channel capacity) of the channels that become clean channels (i.e. having the capacity 
approaching 1) and the rest of the channels become completely noisy (i.e. with the capacity approaching 
0). Arikan showed that the SC decoding algorithm has an algorithmic time and space complexity which is 
0(N ■ log(A)) (The same complexity holds also for the encoding algorithm). Furthermore, it was shown 
[5] that asymptotically in the block length N, the block error probability of this scheme decays to zero 



Generalizations of Arikan's code structures were soon to follow. Korada et al. considered binary and 
linear kernels [3]. They showed that a binary linear kernel is polarizing if and only if its corresponding 
generating matrix is upper-triangular, and analyzed their rate of polarization, by introducing the notion of 
kernel exponent. Mori and Tanaka considered the general case of a mapping <?(•), which is not necessarily 
linear and binary, as a basis for channel polarization constructions [4]. They gave sufficient conditions 
for polarization and generalized the exponent for these cases. They further showed examples of linear 
and non-binary Reed-Solomon codes and Algebraic Geometry with exponents that are far better than the 
exponents of the known binary kernels [5] . The authors of this correspondence gave examples of binary 
but non-linear kernels having the optimal exponent per their kernel dimensions 6 . All of these structures 
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were having homogenous kernels, meaning that the alphabet of their inputs and their outputs were the 
same. The authors of this correspondence considered the case that some of the inputs of a kernel may 
have different alphabet than the rest of the inputs [7] . This results in the so-called mixed kernel structure, 
that have demonstrated good performance for finite length codes in many cases. A further generalization 
of the polar code structure was suggested by Trifonov [8] , in which the outer polar codes were replaced by 
suitable codes along with their appropriate decoding algorithms. We note here, that the representation 
of polar codes as instances of general concatenated codes (GCC) is fundamental to this correspondence, 
and we elaborate on it in the sequel. 

Generalizations and alternatives to SC as the decoding algorithm were also studied. Tal and Vardy 
introduced the Successive Cancellation List (SCL) decoder [9l[T0]. In this algorithm, the decoder consider 
up to M concurrent decoding paths at each one of its stages, where M is the size of the list. At the 
final stage of the algorithm, the most likely result is selected from the list. The asymptotic time and 
space complexity of this decoder are the same of those of the standard SC algorithm, multiplied by M. 
Furthermore, an introduction of a cyclic redundancy check code (CRC) as an outer code, results in a 
scheme with an excellent error-correcting performance, which is sometimes compatible with state of the 
art schemes (see e.g. [10l Section V]). Bonik et al. suggested using a separate CRC and a different list size 
for each outer code, in the GCC structure of the polar code. This approach seems to give better results, 
comparing it to standard list approach with the same average list size. Finally, Li et al. [IT], suggested an 
iterative SCL with CRC algorithm in which the decoder increases the list size by a multiplicative factor of 
2 and restart the algorithm, if at the end of the SCL algorithm there doesn't exist a result that satisfies the 
CRC. Here again, the excellent performance is achieved with limited average list size and outperforms Tal 
and Vardy's original approach. Note, however, that here the average time and space complexity (rather 
than the worst case complexity) is the basis for comparison between the approaches. 

Belief-Propagation is an alternative to the SC decoding algorithm . This is a message passing iterative 
decoding algorithm that operates on the normal factor graph representation of the code. It is known to 
outperform SC over the Binary Erasure Channel (BEC) [12], and seems to have good performance on 
other channels as well [12l [13] . 

Leroux et al. considered efficient hardware implementations for the SC decoder for the (u + v, v) polar 
code [T3JI3]- They gave an explicit design of a "line decoder" with N/2 processing elements and O(N) 
memory elements. Their work, contains an efficient approximate min-sum decoder, and a discussion on a 
fixed point implementation. Their design is verified by an ASIC synthesis. Pamuk considered a hardware 
design of BP decoder tailored for an FPGA implementation [16] , 

The goal of this paper is to emphasize the formalization of polar codes as recursive GCCs and the 
implication of this structure on the decoding algorithms. The main contributions of this correspondance 
are as follows: 1) Formalizing Tal and Vardy's SCL as a recursive algorithm, and thereby generalizing it 
to arbitrary kernels. 2) Formalizing Leroux et al. SC line decoder and generalizing it to arbitrary kernels. 
3) Defining a BP decoder with GCC schedule, and suggesting a BP line architecture for it. 

The paper is organized as follows. In Section [2j we describe polar codes kernels as the generating 
building blocks of polar codes. We then elaborate on the fact that polar codes are examples of recursive 
GCC structures. This fundamental notion, is the motivation for formalizing the decoding algorithms in 
a recursive fashion in Section [3l Specifically, we do this for the standard SC, the SCL (both for Arikan's 
kernels and arbitrary ones) and BP (for Arikan's kernel using the GCC schedule). These formalizations 
lay the ground for hardware architectures of the decoding algorithms in Section [4[ Specifically, we restate 
Leroux et al. SC pipeline and SC line decoders, and introduce a line decoder for the GCC schedule of 
the BP algorithm. Finally, in Section [5l we consider generalizations of these architectures for arbitrary 
kernels. 

2 Preliminaries 

Throughout, we use the following notations. Vectors are denoted by bold letters, for example u. For 
i > i) let u } = ( u j, •••) Ui) be the sub-vector of a vector u of length i — j + 1 (if i < j we say the = (), 
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the empty vector, and its length is 0). 

In this paper we consider kernels that are based on bijective transformations over a field F. A channel 
polarization kernel of dimension £, denoted by g{-), is a mapping 

g : F 1 ^ F e . 

This means that g(u) = x, u,x £ F l . Denote the output components of the transformation by 

gi(u) = Xi 0<i<£-l, 

We note that this type of kernel is referred to as homogeneous kernel, because its £ input coordinates 
and £ output coordinate are from the same alphabet F. 

The homogenous kernel <?(•) may generate a polar code of length £ m by inducing a larger mapping 
from it, in the following way [H [7]. 

Definition 1 Given a transformation g(-) of dimension £, we construct a mapping g^ m \-) of dimension 
£ m (i.e. g^ m \-) : {0, l} 1 — > {0,1}^ ) in the following recursive fashion. 



(7o,o, 7i,o, 72,0, ■ • ■ , lt-1,0) , 

g {1) (70,1,71,1.72,1,- ■ -,11-1,1) >•-•, 

.9 (1) (7o/ m - 1 ,7i,^"- 1 ,72/— 1 , ■ • • ,li-i,t™-i) , 

where 

7i J i=fff" 1) (u£ + m 1 2ti) < i < - 1 0<J<*-1. 

General Concatenated Codes (GCC0 are error correcting codes that are generated by a construction 
technique, which was introduced by Blokh and Zyabolov [18| and Zinoviev f 19) . In this construction, 
we have £ outer codes {C r } r ^Q, where C r is an N out length code of size M r over alphabet F r . We also 
have an inner code of length Ni n and size Ilr=o l-^rl over alphabet F, with a nested encoding function 
(j) : Fq x F± x ... x — > F*". The GCC that is generated by these components is a code of length 
N ou t • Ni n and of size Ilr=o M r . It is created by taking an £ x N out matrix, in which the r th row is a 
codeword from C r , and applying the inner mapping <p on each of the N out columns of the matrix. As 
Dumer describes in his survey [20] , the GCCs can give good code parameters for short length codes due to 
a good combination of outer codes and a nested inner code. In fact, some of them give the best parameters 
known. Moreover, it is common that the decoding algorithms associated with them, utilize their structure 
by performing local decoding steps on the (short) outer-codes and exchanging decisions via the inner code 
decoding. 

As Arikan already noted, polar codes are examples of recursive GCCs [U Section ID]. This observation 
is useful as it allows to formalize the construction of large length polar code as a concatenation of several 
smaller length polar codes (outer codes) by using a kernel mapping (an inner code). Therefore, applying 
this notion to Definition [IJ we see that a polar code of length £ m , may be regarded as a collection of £ 
outer polar codes of length £ m ~ 1 . These codes are then joined together by applying an inner code (defined 
by the mapping g^'(-)) on the outputs of these mappings. This idea is illustrated in Figure [TJ In this 
figure, we see the £ outer codewords of length £ m ~ 1 organized in £ rows of the matrix. The inner codeword 
mapping is depicted as the vertical rectangle that is located on top of them. This is appropriate, as this 
mapping operates on columns of the of the matrix which rows are the outer codewords. Note, that for 
brevity we only drew one instance of the inner mapping, but there should be £ m ~ 1 instances of it, one for 
each column of this matrix. In the homogenous case, the outer codes themselves are constructed in the 
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Figure 1: GCC representations of homogenous kernel according to Definition Q] 
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Figure 2: Arikan's Construction 



same manner. Although the outer coeds have the same structure, they are different in the general case, 
because they may have different set of frozen bits. 

Example 1 (Arikan's Construction) Let u be an N = 2 m length binary vector. The vector u is trans- 
formed to an N length vector x by using a bijective mapping g(-) : {0, 1}^ — > {0, 1} N . The transformation 
is defined recursively as 

forn^l g {l) (u) = [u + ui,u{\ 
fom>\ ,g (m) (u) = [v ,w ,vi,wi,...,v N /2-i,w N / 2 -i] = x , (1) 
where v^ 2 " 1 - g( m ~V (u^ 2 " 1 ) + g^" 1 ) (u^ 1 ) and w^ 2 " 1 = g^ 1 ) (u^ 1 ) . See also Figurem 

In a mixed kernel constructions the outer codes are not necessarily from the same family of polar 
codes. For example, if we take the first kernel <7i(un, Ui,2, Us) =xjj e {0, l} 4 and define the RS kernel 

lr The construction of the GCCs is a generalization of Forney's code concatenation method 17 . 



4 



1-1 




I- 1 


fl (m- 1) 






e 

w 


(m-1) 






e 





Figure 3: Mixed Kernel as GCC 



a s 32(^0,1^ 1*2,3, 1*4,5, 1*6,7) = x o £ ({Oil} 2 ) 0! then the general concatenated construction is given in 
Figure [3J Now, note that using mapping over a binary channel is like using a concatenated scheme 
in which the inner code is the standard binary full space mapping. It can be observed, that the mapping 
in Figure [3] has more potential in transforming between the used alphabets. This concept may be further 
generalized, by replacing some of the outer polar codes, with other types of codes (see e.g. Trifonov's 
proposal |S]). 

The recursive GCC structure of polar codes calls for recursive formalization of the algorithms associated 
with them. These algorithms enjoy from a simple and clear description, which may lead to an elegant 
analysis. Furthermore, in some cases it allows reuse of resources and indicates which operations may 
be done in parallel. The recursive encoding algorithm has already been described in Definition [T] The 
recursive decoding algorithms are described in the next section. 



3 Recursive Descriptions for Decoding Algorithms of Polar Codes 

In this section, we describe decoding algorithms for polar codes in a recursive framework that is induced 
from their recursive GCC structure. Roughly speaking, all the algorithms we consider here have a similar 
format. Consider the GCC structure of Definition [T] This means that we have a length TV code, that is 
composed of £ layers of outer codes, denoted by {Cr} r =cp each one of length N/£. The decoding algorithms, 
we consider here, are composed of I pairs of steps. Pair number r is dedicated to decoding C r _i, in the 
following way. 

** "s 

STEP 2 • r - 1 

Using the previous steps, prepare the inputs to the decoder of code Cj. 
STEP 2 r 

Call the decoder of code Cj on the input you've prepared. 

\ s Process the output of this decoder, together with the outputs of the previous steps. ^/ 

Typically, the codes {C r } r ~Q are polar codes of length N/£, thereby creating the recursive structure of the 
decoding algorithm. 

It should be noted that the above decoding format is quite common for decoding algorithms of GCCs. 
As an example, see the decoding algorithms in Dumer's survey on GCCs [2D]. In addition, the recursive 
decoding algorithms for Reed-Muller (RM) codes, utilizing their Plotkin (u+v, v) recursive GCC structure 
were extensively studied by Dumer [211 122) and are closely related to the algorithms we present here. 
Actually, Dumer's simplified decoding algorithm for RM codes [22J Section IV] is the SC decoding for the 
Arikan's structure, we describe in Subsection 13. II 

The algorithms we describe in a recursive fashion are the SC (Subsection 13. ip . Tal and Vardy's SCL 
fSubsection l3.2l) and BP (Subsection 13. 3[) . For all of these algorithms, we first consider Arikan's (u + v, v) 
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code. For the first two algorithms we also provide generalizations to other kernels, both homogenous 
and mixed. We note, that when possible, we prefer that the inputs to the algorithm and the internal 
computations are interpreted as log likelihood ratios (llrs). Thus, the SC algorithms and the BP are 
described in such manner, but in SCL decoding, we use likelihoods instead of llrs. Furthermore, in our 
discussion we do not consider how to efficiently compute these quantities. In some cases, especially with 
large kernels or with large alphabet size, these computations pose a computational challenge. Approaches 
to adhere this challenge, are efficient decoding algorithms (such as variants of Viterbi algorithms) or 
approximations of the computations (for example, the min-sum approximation that Leroux et al. used 
[15] or the near Maximum Likelihood (ML) decoding algorithms that were used by Trifonov [8]). 



3.1 A Recursive Description of the SC Algorithm 

We begin by considering the SC decoder for Arikan's (u + v, v) construction, and then generalize it to 

arbitrary kernels. First, let us describe the decoding algorithm for length N = 2 code, i.e. for the basic 

kernel (u, v) = (u + v, v) = (a, b). We get as input [A Q , Ab] which are the log likelihood ratios (llrs) of 

the output of the channel (A Q corresponds to the first output of the channel and Ab corresponds to the 

second output). The algorithm has four steps. 

' > 

STEP I 

Compute the lh of u, L u — 2tanh~ 1 (tanh(A a /2) tanh(A{,/2)). 

STEP II 

Decide on u, (denote it by u). 

STEP III 

Compute the lh of v, (given the estimate of u): L u — (— \) u ■ X a + X^. 
STEP IV 

\ s Decide on v, (denote it by v). ^/ 

It should be noted that steps II and IV, may be done based on the llrs computed on steps I and III (i.e. 
by their sign), or by using an additional side information (for example, if u is frozen, then the decision is 
based on its known value). 

Now, for describing a SC decoder of length N — 2™, let us assume that we already developed a SC 
decoder for length N/2 polar code. We assume that the N length decoder gets as input N channel 
output llrs, {A^j^Q 1 , and the frozen bits indices. The decoder outputs the estimation of the information 
(unfrozen) bits and the estimation of the codeword that was sent on the channel. For convenience, 
we assume that the estimation of the information word is an N length vector (denoted by u) which 
includes also the values for the frozen bits. A decoder for length N polar code contains the following 
steps. 

»■» »s 

STEP I 

Partition the lh vector into pairs of consecutive lh values {(A 2 i, X2i+i)}^l'^~ 1 ■ Compute 
the lh input vector, L^ 2 " 1 , for the first outer code such that 

L, = 2 tanfT 1 (tanh(A 2l /2) tanh(A 2l+ i/2)) , < i < N/2 - 1. 



STEP II 

Give the vector L^ 2 " 1 as an input to the polar code decoder of length N/2. Also provide 
to the decoding algorithm, the indices of the frozen bits from the first half of the codeword 
(corresponding to the first outer code). 

Assume that the decoder outputs u^ ' as the estimation of the information word, and 
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x^ ' as the estimation of the first outer polar codeword of length N/2. Both of them are 

vectors of length N/2. Then, we can output u' ) as u^ 2-1 (the first half of the estimated 
information word). 

STEP III 

Using, again, the input llr pairs and x(°) as the estimation of the first outer polar codeword, 
prepare the llr input vector for the second outer code, L^ 2_1 , such that 

(0) 

Li = (-l) x * ■\2i+\2i+i ,0<i< N/2-1. 

STEP IV 

Give the vector L^ 2 ^ 1 as an input to the polar code decoder of length N/2 and the indices 
of the frozen bits from the second half of the codeword (corresponding to the second outer 
code). Assume that the decoder outputs as the estimation of the information word, 
and x^ 1 ) as the estimation of the second outer polar codeword of length N/2. Then, we 
can output u' 1 ) as uj^ 1 (the second half of the estimated information word). 



v Construct the estimation of the codeword as follows x 

Let us now generalize this decoding algorithm for the GCC scheme with a general kernel, in this case 
for length N code, we have an I length mapping g(u) = x over an alphabet F, i.e. g(-) : F l — > F l . We 
also have at most £ outer codes {C r }, each one of length N/L We may have less than £ outer codes, in 
case some of the inputs are glued (which results in a mixed kernel case). In this case, the outer code 
corresponding to the glued inputs is considered to be over a larger size input alphabet. We assume that 
each outer code has a decoding algorithm associated with it. This decoding algorithm is assumed to get as 
input the "channel" observations on the outer code symbols (usually manifested as probabilities matrices, 
or llr vectors). If the outer code is a polar code, then this algorithm should also get the indices of the 
frozen bits of the outer code. We require that the algorithm will output its estimation on the information 
vector and its corresponding outer code codeword. Assuming that we know input symbols u§, computing 
the llr vector L(-) corresponding to input number k + 1 of the transformation is done according to the 
following rule. 

^=infg u ^ £ ^ +ifi5( ;! ,0,u h ) V (2) 



i 

k+ll 



where 




i? 3 K- i )=exp > AGfc(u)) , (3) 



which is the likelihood ratio associated with input u to the kernel g(-), and Ai(-) is the llr associated with 
the i th output of the kernel, Xi. Because F may be non-binary, A(-) and L(-) are assumed to be functions 

of llrs, that is Xi(t) = log ( pr(y |x ■ =°t) ) ' ^ or * e ^' w h crc Y i s assumed to be the vector of the observations. 

We now describe the SC decoding algorithm. As we already mentioned, because of structure of the 
code, the decoding algorithm is composed of pairs of steps, such that pair r deals with outer code r — 1, 
where 1 < r < £. As a preparation step, we partition the decoder's N length input llr vector A(-) to N/£ 
length vectors, each of length £, denoted by A^ m '(-), such that 

Ai m) (0 = A m . £+ ,(.) Q<m<N/£-l, 0<i<£-l. (4) 

The £ length vector A^ m )(-) is associated with the output symbols corresponding to the m th symbol of the 
outer codes (transformed by kernel (/(•))■ We denote the information word that was given by the decoder 
of the m th outer code by and its corresponding codeword by x^ m \ both of them are of length N/l. 
We have the following pair of steps of the algorithm 1 < r < £. 



7 



STEP 2 • r - 1 

Using the results on the outer-codewords of the previous steps i.e. x^ m \ for < m < r — 2, 
prepare the N/l length llr input vector L(-) for outer code number r — 1. To do that, for 

< j < N/£ — 1, compute Lj(-) using ([2| with \ as the estimated inputs 

L J 0<m<r— 2 

to the transformation. 
STEP 2 r 

Give the llr vector L(-) as an input to the decoder of outer code number r — 1. If this 
is a polar code decoder of length N/£, then also supply the indices of the frozen symbols 
in the range [(r — 1) • N/£, r ■ N/l — 1]. The decoder outputs u^ 1 " -1 ), as the estimation of 
the information word and x( r_1 ) as the estimation of the outer codeword. Both of these 
vectors are of length N/l symbols. 

After the step 2 • the decoder outputs its estimation for the information word, by 
concatenating the information parts generated by all the outer code decoders, i.e. u = 
[u^ ', uW, u^ -1 '] . The estimation of the codeword, x, is done by applying the trans- 
formation g(-) on the column of the matrix, which rows are 

{ x(m) } m =o> that 

is 



*(<+i)-x _ ( (o) (i) u-iy 



} ) , 0<i<N/£-l. (5) 

The base case of the recursion, i.e. the decoder for N — I length polar code, is a simple generalization 
of the SC decoder for length N = 2 code of Arikan. The idea is to successively estimate the input bits to 
the transformation g(-), using @. We decide on the symbol Ui using the llr generated by @ (in which 
our previous decisions are taken as known values). If it.; is frozen, we skip the calculation of @, and 
decide on its known value. 

In case we have a mixed kernel construction, the generalization is very easy. Let us assume that we 
have glued the symbols U\ and U2 to a new symbol Ui 2 £ F 2 . In this case, we treat these two symbols as 
a one entity, and consider the outer code associated with them, denoted as C\ %, as N/l length code over 
the alphabet F 2 . The only change we have in the decoding algorithm is for the pair of steps corresponding 
to this "glued" outer code. For the first step in the pair, we need to compute the N/£ length llr vector 
L(-, •), that serves as an input to the the decoder of Ci.2- In this case, we need that each llr function in 
the vector, will be a function of both u\ and U2- Equation ([2]) is therefore updated accordingly. 

L(ti,ta)=ln ^ —7 — jr). (6) 

The second step of the pair is remained unchanged. 



3.2 A Recursive Description of the SCL Algorithm 

Tal and Vardy introduced an efficient SCL decoder [TU]. We give here a recursive description of this 
algorithm. In the algorithm, there is a requirement to compare between the likelihoods of different 
decoding possibilities. Therefore, we need to assume that inputs to the algorithm as well as its internal 
computations are interpreted as likelihoods, instead of llrs. Note, that if the decoding list is of size 1, 
then the formulation we give below is of the SC decoder we described in the previous subsection (with 
the only difference that we use likelihoods instead of llrs to describe the computations) . 

We also note, that here we only describe the algorithm to generate the list. At the end of the algorithm, 
the most likely element of the list should be given as output. If there is an outer CRC, only outputs that 
agree with the CRC should be considered. The notion of likelihoods normalization that was considered 
by Tal and Vardy pJJJ Algorithm 14] to avoid floating-point underflow is also applicable here. These two 
issues and their generalization are not further discussed in this paper. 



8 



The algorithm of SCL decoding of N length polar code with list of size M gets as input the following 
structures. 



• Two likelihood matrices II^ ) and II^ 1 ) of M x N dimension, which represent M arrays of conditional 
probability values (each array of length N) - each one corresponds to an input option, that the 
decoder should consider. We refer to these input options as models. The plurality of the models 
exists, because at any given point, in the list decoding algorithm we allow M options for past 
decisions on the symbols of the information word (these options form the list). Each one of these 
options induces a different statistical model, in which we assume that the information sub-vector, 
which is associated with it, is the one that was transmitted. We have Tl^j = Pr(Y^ W = yf\Vj = b), 
where is the measurement of the j th channel Vj — > Yj of the i th option in the list and b e {0, 1}. 

• A marker p in indicating how many rows in II^ ' and I^ 1 ' are occupied. The algorithm supports 
decoding of p in E [1,M] input models. 

• The vector of the indices of the frozen bits. 
The algorithm outputs the following structures. 

• A matrix U of M x N dimension, which represents M arrays of information values (each array of 
length N) - this is the list of the possible information words that the decoder estimated. 

• A matrix X of M x N dimension, which represents M arrays of codewords (each array of length N) 
- this is the list of codewords that correspond to the information words in U. 

• An indicator vector s^ _1 , that indicates for each row in U and X to which row in the input II' ) and 
II^ 1 ) it has originated from (i.e. it refers to the statical model that was assumed when estimating 
this row). 

• A marker p out indicating how many rows in U or X are occupied. 
For the basic N — 2 length case the algorithm operates as follows. 



STEP I 

For each of the p m occupied rows of IT™ and Ir^ compute P t (0) = \ (n$ • 11$ + 11$ • 11$) 

and j>« = J (njg • n<3 + n<2 . n<3) , far o < < < - 1. 

STEP II 

Concatenate the two vectors to one 2 • pi n length vector, P = [P^pW]. 

Let P be a vector that contains the p = min{2 • p in , M} largest values of P. Let s^°\ u^°\ 
be p length column vectors corresponding to P, such that the i th element of P is element 
number s^ "* in the vector p( u i ). This element was originated from model number sj°\ 
which means that it was computed assuming that row number s| 0) of n<°) and n« was 
the statistical model. 

REMARK: If u is frozen (without loss of generality assume that it is set to the value), 
then steps I and II should be skipped and s' ) = [0, 1, pi n — 1] ,u(°) = 0. 

STEP III 

Generate two p length vectors, P(°) and P^. For each of the p occupied rows of s^°\ 
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compute (i € [0, p — 1]). 



,(o) _ 1 ) s °»,o s °>,i' 1 ' cr, 
2 1 n (1) -n (0) x (0) -l [ ' 



1 f n (1) • n (1) x (0) -o- 
4 ~2l n (0) • n (1) x (0) - 1 1 j 

STEP IV 

Concatenate the two vectors to one 2 • p length vector, P = [P^ -*, P^]. 

Let P be a vector that contains the p out = min{2 • p, M} largest values of P. Let s^, u^ 1 ', 
be p ou t length column vectors corresponding to P, such that the i th element of P is element 
number of the vector p( u i ). 

REMARK: If the second bit is frozen (without loss of generality assume that it is set to 
the value), then steps III and IV should be skipped and = [0, 1, ...,r — 1], u' 1 ) = 
0,/w = r. 
Output: 

• Pout 

• Si = s^ s.t a — s- 1 ^ «e[0,p o „t-l] 

• U = [u(°);uW] 

\ . X=[u'°)+u' 1 );uW] y 

JNow, for describing a SC decoder for length N = 2" polar code, let us assume that we already developed 
a SCL decoder for length N/2 polar code. Therefore, a decoder for length N polar code contains the 
following steps. 

^ 

STEP I 

Prepare the probability transition matrices for the first polar outer code decoder. Specif- 
ically, generate two matrices P^ of dimension M x N/2, b G {0, 1}, such that for 
< i < p ln - 1 < j < N/2 - 1 

p(0) 



i,3 

and 



^(nS. r ng, +1+ ng,.ng, +1 ) (9) 
^^^(n^-nS^+ng,.^^) (10) 



STEP II 

Give the M x ^ matrices P(°) and P^, the frozen bits from the first half of the codeword 
and pi n as the number of elements in the list as inputs to the polar code decoder of length 
N/2. Assume that the decoder outputs U^ ^ and X^ ^ as the list of estimations of the 
information word and the outer polar codeword of length N/2, respectively. Both of these 
structures are matrices of dimension M x N/2. The decoder also outputs s' ^ as the source 
indicator vector (of length M), and p as the size of the list. 

STEP III 

Prepare the input matrices for the decoder of the second outer polar code of length N/2. 
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Specifically, generate two matrices of dimension M x N/2, b £ {0, 1}, such that for 
< i < p-1, 0<j< N/2 - 1 



, n (o) n (o) y (o) _ n . 

Cnl ! («) n • (O) n • i 1 1 (0) ■ — U ' 



and 



p(l)_l / n lf\2./ n $\2. j+ l' X | o) ',i- 0; n2 x 

P «-r(n^,n;j) w , (12) 

STEP IV 

Give these matrices P^ -* and P^), the vector of indices of the frozen bits from the second 
half of the codeword and p (as the number of elements in the list) as inputs to the decoder 
of the second outer polar code of length N/2. 

Assume that the decoder outputs U^ 1 ) and X^ 1 ) as the list of estimations of the informa- 
tion words and their corresponding outer polar codeword of length N/2. Both of these 
structures are matrices of dimension M x N/2. The decoder also outputs s' 1 ) as the source 
indicator vector (of length M) and p out as the size of the output list. 

Now, generate the outputs of the decoder (i £ [0,/cw — 1]): 

• = 4%' where a ( i ) = s i 1} - 

.u_ ri = [uSi 4 ,u« (0 ]. 

• ^jeven — ■ yv — >Si T >fj^) 

where X. iyBven (X ij0( jd) are the vectors of the even (odd) indices columns of row number i 

\ s in the matrix X, and for a matrix A, the i th row is denoted by A_^. ^/ 

Let T(n) be the decoding time complexity, for length N = 2™ polar code. Then T(n) = 2 ■ T(n — 1) + 
0(M ■ N), and T(l) = O(M), which results in T(n) = 0(M ■ N ■ log 2 N). Similarly, the space complexity 
of the algorithm can be shown to be 0(M ■ N). 

The generalization of the decoding algorithm for a homogenous kernel of dimension I with alphabet 
F is quite simple. Here we emphasize the principal changes, from the (u + v,v) algorithm. First, the only 
change in the input is that we should have input channel matrices, 11^, one for each b £ F. In the 
decoding algorithm, we have I pairs of steps, such that each one is dedicated to a different outer codes. 
Before step 2 • r — 1, we have decoded outer-codes Cj where < i < r — 2. We assume, that we have 
temporary lists XW and TjW of the estimated codewords and their corresponding information words, 
which are represented by matrices of size M x N/l. The i th matrix corresponds to the decoding of d, 
< i < r — 2. We maintain a temporary indicator vector s' ^ of length M, such that X^ and were 

estimated assuming model . We also have p as the number of occupied elements in the list so far (on 
the initialization, p — p in ). 

\ 

STEP 2 • r - 1 

Using the decoding results of the outer-codewords from the previous steps i.e. X( m ), for 
< m < r - 2, prepare the N/l length likelihood lists, {P (6) } 6£ir - Each list is an M x N/l 
matrix, and all of them will serve as an input to the decoder of the N/l length outer code 
number r — 1. For the computation of row i of P^, use the input statistical model s\°\ 



11 



that is the likelihoods in rows < > . Also, as the estimated codewords of the 

I ^ ibeF 

previous outer codes, we need to use rows 1 \ ■ To prepare {P^H,^ „ we do 

L J 0<m<r-2 L ' 

computations on likelihoods (instead of llrs) , which are the equivalent to step 2 • r ~ 1 in 
the description of the general SC decoding (Subsection 13.11) . 

STEP 2 r 

Give the matrices {P^- ) } &6i? , the vector of indices of the frozen bits from the second half 
of the codeword and p (as the number of elements in the list) as inputs to the decoder of 
outer polar code number r — 1. 

Assume that the decoder outputs U' r_1 - ) and X^ -1 ' as the list of estimations of the 
information word and their corresponding estimations of the transmitted codeword of the 
outer code number r — 1, respectively. Both of these structures are matrices of dimension 
M x N/£. The decoder also outputs s^ 1 ) as the model indicator vector (of length M) and 
p as the number of occupied elements in the list. 

Allocate s, a temporary vector of size M, and temporary matrices X.W, tjW of size M x 
N/t, where < i < r - 2. 

• Si = wri ere a(i) = s± and < i < r — 2. 
. *W = XW (J) < i < r - 2, < j < p - 1 
•U^ = U^ (J) 0<i<r-2, 0<j<p-l 

Copy these matrices to the internal data structures. 

• s<°> = s. 

• X( J ) = XW < i < r - 2 

• UW = U( J ) < i < r - 2 

If this is step 2 • t (the last step), then prepare the output. 

• Pout = P- 

• s. 

• U=[U(°';U( 1 );...U( < - 1 )]. 
v ' ^.m-Mm+D-i = 9 (Xjjj Xg^ X^ 1} ) , 0<m<N/e-l. ^ 

Where for a matrix A, the subvector that is composed of the columns n\ to ni of the i th row is denoted 

by A-i^m :ri2 • 

The decoder for the basic N = £ length code, also contains I pairs of steps. The decoding is similar 
to the above, with the exception that instead of delivering the likelihood matrices {P } 6gF (here these 

matrices are actually column vectors) to a decoder, we concatenate them to a vector P and choose the 
p = min{M, 2 • p} maximum elements from it, and generate the indicator vector and the information 
symbols list u^ r_1 \ similarly to the case of the N = 2 length decoder of the (u + v, v) construction. 

In case the kernel is mixed, the generalization is also quite easy. Let us consider the mixed example, 
from the end of Subsection 13.11 The only changes we have in the decoding algorithm, are for the pair 
of steps associated with the glued outer code C\^- In step 3 (the preparation step for this outer-code), 
we prepare |-F| 2 input matrices p( bl ' b2 ) ; for (61,62) € F 2 . For this, we use the equivalent of equation 
([6]) for likelihoods (instead of llrs). The decoder of is supposed to return a list of estimations of 
the information words, their corresponding codewords and the model indicator vectors. These outputs 
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Figure 4: Normal Factor Graph Representation of Arikan's Construction 

and the temporary structures are re-organized, as is done in step 2 • r for the decoding algorithm of the 
homogenous kernel polar code. Note, however, that at the end of step 4, there are three information words 
lists U^ -*, and U^ 2 ) along with their corresponding three outer codewords lists. This is because we 
have decoded the glued outer code Ci j2 simultaneously, which contributed U^ 1 ', U' 2 \ C^ 1 ' and C^ 2 ' in 
the same decoding step. 

3.3 A Recursive Description of the BP Algorithm 

BP is an alternative to SC decoding [T[. It is an iterative message-passing algorithm, which messages 
are defined using Forney's normal factor graph |23j . There is no evidence which algorithm is better for 
general channels, except for the BEC, in which BP is shown to outperform SC [12] . However, simulations 
indicate that BP outperforms SC in many cases. The order of sending the messages on the graph is called 
the schedule of the algorithm. Hussami et al. suggested to use a " Z shape schedule" for transferring the 
messages [12J, Section II. A]. Here we prefer, to present a serial schedule which is induced from the GCC 
structure of the code. 

We begin by describing the type of messages that are computed during the algorithm. Figure |4] depicts 
the normal factor graph representation of Arikan's kernel. We have 4 symbol half edges denoted by u, v, xq 
and x\. These symbols have the following functional dependencies among them xq = u + v and x\ = v. 
The messages and the inputs that may be sent on the graph are assumed to be llrs, and their values are 
taken from R(J{±oo}. The oo and -co are special types of llrs, that indicate known values of and 1, 
respectively. They are used to support the existence of the frozen bits of the polar code. 

For the symbol half edges, we assume that we have 4 input llr messages. These messages may be 
generated by the output of the channel, by known values associated with frozen bits, or by computations 
that were done in this iteration or previous ones. We denote these messages by /i„ , fiv^ , Mx™ an d A 4 ^™^ • 
The algorithm computes (in due time) 4 output llr messages, fJ,u , fJ.i° ut \ ^x°o^ an d fJ-x^^, indicating 
the estimations of u,v,xq and x\, respectively, by the decoding algorithm. The messages are computed 
using the extrinsic information principle, i.e. each message that is sent from a node on an adjacent edge is 
a function of all the messages that were previously sent to the node, except the message that was received 
over the particular edge. The nodes of the graphs are denoted by ao (the adder functional) and e\ (the 
equality functional). Using the ideas mentioned above we have the following computation rules. 



Mei ^ O0 =/ (=) ( M r^47 j ), (13) 

Mao^e, =f { + ) (pli n \v { ™ ) ), (14) 

ti Ut) =f {+ 0::\He 1 ^a o ) 1 (15) 

H^=f { ^ ao ^H^), (16) 

Mir ) =/(+)(^ 1 ^a ,^ m) ), (17) 

^T^fi^Pao^,^), (18) 
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where /( = ) (zq, zi) = zq + z\ and ft + \(zo, z\) = 2 tanh -1 (tanh(^o/2) ■ tanh(zi/2)). Note that, fJ, a ^-/3 where 
a,j3 € {e%,ao} is the message which is sent from node a to node /3. [i u ° ut ^ and //i""'^ are sent from ao 
over the half edges corresponding to symbols u and xq, respectively, [iv and /iij" are sent from ei 
over the half edges corresponding to symbols v and x\ , respectively. 

We, now, turn to give a recursive description of an iteration of the algorithm. The factor graph of 
the length N code, has log 2 N layers. In each layer, there exist N/2 copies of the normal factor graph, 
that we depicted in Figure |4] Their organization can be implied from the recursive description in Figure 
[21 Therefore, for each layer, we have N/2 realizations of the input messages, output messages and inner 
messages (each one is corresponding to a different set of symbols and interconnect). To denote the i th 
realization of these messages, we use the notation n a ^f}^, and ^°T\ where a, /3 6 {ao,ei} and 

7 G {x®, x ii u i v }- As before, we denote the channel llrs by the length N vector {Ai}^j x . 

r ' " *\ 

1 STEP I * 

N/2-1 



Partition the llr vector into pairs of consecutive llr values j Ma"!) = (A2i,A2i+i)j 



i=0 
N/2-1 



Compute the messages {^e-L^a ,i\flo 1 using ([T5|l. Compute the messages |/4i°"''} j 
using (j!5[) (Note that the two computations in this step can be combined to one compu- 
tation). 

STEP II 

f (out)} N / 2 ^ 1 

Give the vector < fj^ i ' > as an input to the polar code BP iterative decoder of length 

N/2. Also provide the indices of the frozen bits from the first half of the codeword. Assume 

f (m)l N / 2 - 1 

that the decoder outputs | pb u i j and the estimation of the information word. 

STEP III ^ 2 j 

Compute the messages {^ Qo _>. ei! i}^ : / 2 ~ 1 using ([T3)l, Compute the messages { 1 

using (IT5|) (Note that the two computations in this step can be combined to one compu- 
tation). 

STEP IV 

f (out)} N / 2 ^ 1 

Give the vector < ' > as an input to the polar code decoder of length N/2. Also 

^ ' J i—0 



provide to this decoder the indices of the frozen bits from the second half of the codeword. 
Assume that the decoder outputs S £t„ j \ and the estimation of the information 

I ' J i—0 

word of the second outer polar codeword of length N/2. 

The information part may be concatenated to the information part of step II, to generate 
the decision on the information word after this iteration. 

Compute the messages {fx ei ^ ao ^}fj 2 1 using (|13|). 

f (out)} N / 2 ^ 1 r (out)} N / 2 ~ 1 
Compute the messages j and j^xTi } using (fTTj) and jTSJ), respectively. 

Any input message or inner message, unless given (by the channel output or by a prior knowledge on 
the frozen bits) is set to before the first iteration. It is assumed that the inner messages are preserved 
between the iterations (and see a further discussion in the sequel). 

To complete the recursive description of the algorithm, we need to consider the case of the length 
N = 2 code. Assume, that we get jJLx„\^x^ as the input values. Also, before the first iteration initialize 
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w e {u, v} 



(in) _l 0, w is not frozen; 

1 (— l) b -oo, u is frozen and equals b. 



r 



STEP I 

Compute /i ei _j. ao according to (fT3|) . 
STEP II 

If u is not frozen, compute ^1°"'^ according to (|15|) . and make a hard decision on this bit, 
based on its sign. 

STEP III 

Compute /i ao _> ei according to (fT4|) . 

STEP IV 

If u is not frozen, compute according to (fl6|) . and make a hard decision on it, based 

on its sign. 

Compute /xj° Mt \ according to (fT7|). (fT8|). J 

We should note that 

/( =) (±oo,zi) = / (= )(z ,±oo) = ±oo (20) 

/ (+) (±oo,z 1 ) = ±zi, f(+)(z , ±oo) = ±z . (21) 

We further note, that for N = 2 length code, steps I and II can be combined to one operation, and similarly 
steps III and IV can be combined to one operation. Both of these combined steps are independent, so 
they may be performed in any order, or in parallel. 

In this implementation, we assumed that there is a memory for storing messages of type /il , l^v^ , 
/•(io > fJ>x^ and /x ao _j. ei , that were previously computed. This memory is dedicated for each realization 
of such messages, specifically, for each layer of the graph and for each (u + v,v) normal subgraph, as 
in Figure SI Actually, for this particular schedule, excluding fly , we do not need to save any message 
beyond the iteration boundary (this observation reduces the required memory consumption as we'll see 
in the hardware implementation). The memory consumption of the algorithm is (N ■ log(A)). The 
running time is also (JV • log (A)), assuming no parallelism is allowed. 

In each iteration, we send one instance for each of the possible messages and for each (it + v, v) block 
realization in the code, except for the /i ei _>. ao for which we send two messages (for all the layers, besides 
the last one). The full implementation may contain several iterations. The number of iterations may 
be fixed or set adaptively, which means that the algorithm continues until some consistency criteria are 
fulfilled. An example for such a criterion, is that the signs of the lh estimations for all the frozen bits 

agree with their know values (i.e. if all the frozen bits are set to zero, then sign ($ ut) ) > of all the 

frozen bits, 7). In this case, one can stop an iteration in the middle by holding a counter in a similar 
way to the method that is usually used in BP decoding of LDPC codes using the check-node based serial 
schedules (see e.g. [H]). We note, however, that in the LDPC case, the consistency is manifested in the 
fact that all the parity check equations are satisfied. 

In the next section we describe hardware architectures for the decoding algorithms we covered so far. 



4 Recursive Descriptions of Hardware Architectures of Decoders 
for Arikan's Construction 

We now turn to study hardware architectures, that are inspired by the recursive decoding algorithms, 
which we presented in Section [3] This section covers hardware architectures for Arikan's (u + v, v) 
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Figure 5: Block Diagram for the SC Pipeline Decoder. 



construction. A generalization of this discussion to other kernels is presented in Section [5j We begin by 
the simple SC pipeline decoder (Subsection 14. 1[) . and then progress to the more efficient SC line decoder 
fSubsection !4.2l) . Both of these designs were presented by Leroux et al. [Ml [15] in a non recursive fashion. 
We finish by considering a BP line decoder ( Subsection I4.3|) . 

It is important to note that throughout the hardware discussion, our presentation is relatively abstract, 
emphasizing the important concepts and features of the recursive designs without dwelling into all the 
details. As such, the figures representing the block diagrams should not be considered as full detailed 
specifications of the implementation, but rather as illustrations that aim to aid the reader in the task of 
designing the decoder. 

We usually prefer to use the same notation for signals array or registers arrays. Let it(0 : N — 1) be 
an N length signals array, then its i th value is denoted by u(i). If v is a two dimensional array of M rows 
and N columns, we denote it by v(0 : M — 1, : N — 1). Naturally, the i th row of this array is denoted 
by v(i, : N — 1), and it is a one dimensional array of N elements, of which the j th clement is denoted 
by v(i,j). 



4.1 The SC Pipeline Decoder 

A block diagram of the SC pipeline decoder for Arikan's construction is depicted in Figure [5] The main 
ingredients of the diagram are listed below. 

1. Processing Element (PE) - This is the basic computation unit of the decoder. It gets as input two 
channel llrs, an estimate of the u input for the (u + v, v) mapping and a control signal, c u , indicating 
wether to compute the llr of u (c u = 0) or v (c u = 1). Note, that the estimate of u is only needed 
in the latter case. 

2. A(0 : N — 1) - An array of N registers holding the llrs from the channels. 

3. SC decoding unit of polar code of length N/2 - This unit has the following inputs: N/2 length 
signals array of input llrs and a binary signals array containing the indices of the frozen bits of the 
code. Its outputs are u(0 : N/2 — 1), which is the estimation of the transmitted information word 
(including the frozen bits), and i(0 : N/2 — 1), which is the estimation of the transmitted codeword. 
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4. A register for the estimated information word u(0 : N — 1). 

5. Encoding unit for generating the estimated codeword, it includes a register for the codeword x(0 : 
N — 1) and AT/ 2 bitwise xor circuits for generating the codeword based on the output of the N/2 
length decoder. 

We note that a basic N = 2 length decoder has only one PE, and operates according to the algorithm 
described in Section [2j The algorithm for N > 2 is based on the notion of the recursion, as we describe 
below. 



STEP I 

Using the processing elements PEq, PE\, P£V/2_i with c u = 0, prepare the llr input 
for the decoder of the first N/2 length outer code and output it on the signals array 
L(0 : N/2 - 1), such that 

L(k) = 2 tank -1 (tanh(A(2fc)/2) tanh(A(2fc + l)/2)) , < k < N/2 - 1. 



STEP II 

Give the signals array L(0 : N/2 — 1) and the list of indices corresponding to the first half 
of the codeword (i.e. the first outer code) as inputs to the polar code decoder of length 
N/2. 

Call the decoder of length N/2 polar code on these inputs (decoding the first outer polar 
code). 

Store u(0 : N/2 - 1) = fi(0 : N/2 - 1), x even (0 : N/2 - 1) = x(0 : N/2 - 1). 
STEP III 

Using the signals array x(0 : N/2 — 1) as the vector of estimations of u from the (u + v, v) 
pair, operate the processing elements PEq, PE%, P-Ejv/2-i with c u = 1. This will 
prepare the llr input for the second outer code, and output it on signals array L(0 : 
N/2-1), such that 

L(k) = (-lf {k) \(2k) + \{2k + 1), < k < N/2 - 1. 



STEP IV 

Give the signals array L(0 : N/2 — 1) as an input to the polar code decoder of length 
N/2. Also provide the indices of the frozen bits corresponding to the second half of the 
codeword (i.e. the second outer code). 

Call the decoder of the length N/2 polar code on these inputs (which means that we 
decode the second outer polar code). 

Store u(N/2 : N - 1) = u(0 : N/2 - 1), x even (0 : N/2 - 1) = x even (0 : N/2 - 1) + x(0 : 
\ N/2 - 1), x odd (0: N/2 -l)=x(0: N/2-1) J 

Here, for an array x, we denote by x even and x dd the 2— decimated arrays containing x's even indices 
samples and odd indices samples, respectively. Note, that to avoid any delays due to sampling by a 
register, it is important that the codeword estimation (which is one of the outputs of the decoder) will be 
the output of the encoding layer and not the register following it. This issue and further timing concerns 
are considered in the next subsection. 

Let us consider the complexity of this circuit. We assume that a call to a PE finishes in one clock 
cycle. Denote by T{n) the time (in terms of the number of clock cycles) that is required to complete the 
decoding of N = 2 n length polar code. Then, T(n) = 2 + 2 -T{n- 1) n > 1 and T(l) = 2. This recursion 
yields T(n) = 2N — 2. Denote by P(n) the number of PEs for a decoder of length N — 2™ polar code, we 
have P(n) = 2"" 1 + P(n - 1) n > 1 and P(l) = 1, so P(n) = 2™ - 1 = N - 1. The cost of the encoding 
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unit is of 2 ■ Y^i=i 2* = 4 • (A — 1) bits registers, and J27=o 2 l = N — 1 xor circuits. We should have R(n) 
registers for holding lh values, so R(n) = 2 n + R(n-l) n>l and R(l) = 2, so R(n) = 2-P(n) =2N-2. 
Note, that in this design, we assume that the re-encoding unit is a combinatorial circuit. 

4.2 The SC Line Decoder 

In the pipeline design of the decoder of length A, the N/2 processing elements {PEk}k=o \ are only used 
during steps I and III of the algorithm. During the other steps (that ideally consumes 2-T(n— 1) = 2N — 4 
clock cycles of the total 2N — 2), these processors are idle, and this results in an inefficient design. To 
improve this, we observe that the maximum number of operations that can be done in parallel by the PEs 
in the SC decoding algorithm is N/2. So, in order to allow this maximum level of parallelism, a design 
must have at least N/2 processors. The line decodefl that we describe in this subsection, achieves this 
lower-bound. In order to support this, we need to redefine the decoder of length N polar code. 
First, The line decoder has two operation modes. 

Standard Mode (S-Mode) 

In this mode, the decoder gets as inputs llrs and the indices of the frozen bits, and outputs the hard 
decision on the information word and its corresponding codeword (this is the operation mode we 
assumed so far). 

PE-Array Mode (P-Mode) 

In this mode, the decoder gets as input a signals array of llrs A(0 : A — 1), a control signal c u , and a 
binary array of length N/2, z(Q : N/2 — 1). The output is a signals array L(0 : N — 1) of llrs, where 
for < k < N/2 - 1 

rm f 2 • tanlT 1 (tanh(A(2fe)/2) • tanh(A(2fc + l)/2)) , c u = 0; , , 

L{k) ~ \ (-l)*( fe )-A(2fc) + A(2fc + l), c„ = l. [22 > 

In Figure |6j we give a block diagram of this decoder. Note, that in order to maintain the maximum level 
of parallelism, the length TV polar code decoder has N/2 processors. Thus, in order to build the length N 
polar code decoder using an embedded N/2 length polar code decoder (already having N/4 processors), 
we use an additional array of A/4 PEs, which is referred to as the auxiliary array. The input signal 
modeln indicates wether the decoder is used in S-Mode or in P-Mode. The mode signal is an internal 
signal that controls wether the N/2 length embedded decoder is in P-Mode. 
The algorithm for the S-Mode is described below. 

STEP I 

Simultaneously, 

• At the multiplexers array (MUX array), at the input of the embedded decoder of 
length N/2 polar code, set the control signal c m = 0, which means that the array 
A(0 : N/2 — 1) is selected as an input to this unit. Set c u = and use the decoder of 
length N/2 polar code in P-Mode, which causes this unit to output the signals array 
(0 < k < N/4 - 1) 

L(k) = 2 ■ tanlT 1 (tanh (A(2fc)/2) • tanh (A(2fc + 1) /2)) . 

Store this array in the registers array R(0 : N/4 — 1). 



2 We note, that the original line decoder, which was presented by Leroux et al. 1 151 Section 3.3] is not precisely the same 
design, which we discuss here. The differences, appear to be minor (especially in the area of the routing between the llr 
registers and the PEs), so we preferred not to distinguish it from Leroux's design by giving it another name. 
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• Use the auxiliary array of processors and compute for N/A < k < N/2 — 1 

L(k) = 2 • tanrT 1 (tanh (A(2fc)/2) • tanh (A(2fc + 1) /2)) 
and store them in the registers array R(N/A : N/2 — 1). 

STEP II 

• At the MUX array, at the input of the decoder of the length N/2 polar code, set the 
control signal c m = 1, which means that content of the registers array R(0 : N/2 — 1) 
is selected as an input to this unit. 

• Provide the vector of indices, corresponding to the frozen bits from the first half of 
the codeword to the N/2 length decoder. 

Call the decoder of the length N/2 polar code in S-Mode on these inputs (decoding 
of the first outer polar code). 

Store u(0 : N/2 - 1) = u(0 : N/2 - 1), x even (0 : N/2 - 1) = x(0 : N/2 - 1). 

STEP III 

Simultaneously, 

• At the MUX array, at the input of the embedded decoder of length N/2 polar code, 
set the control signal c m = 0, which means that the array A(0 : N/2 — 1) is selected 
as an input to this unit. 

Set c u = 1 and use the decoder of length N/2 polar code in P-Mode, which causes 
this unit to output the signals array (0 < k < N /A — 1) 

L(k) = (-If « • A(2fc) + A(2fc + 1). 

Store this signals array in the registers array R(0 : N/A — 1). Note, that we use 
x(0 : N/A — 1), the estimation of the first half of the codeword, that the embedded 
decoder gave as output in step II, as an input to this unit. 

• Use the auxiliary array and compute for N/A < k < N/2 — 1 

L(k) = (-If « • A(2fc) + A(2fc + 1) 
and store them in the registers array R(N/A : N/2 — 1). 

STEP IV 

At the MUX array, at the input of the decoder of the length N/2 polar code, set the 
control signal c m — 1, which means that the array of registers R(0 : N/2 — 1) is selected 
as an input to this unit. 

Provide to the N/2 length decoder, the vector of indices, corresponding to the frozen bits 
of the second half of the codeword. 

Call the decoder of the length N/2 polar code in S-Mode on these inputs (decoding of the 
second outer polar code). 

Store u(N/2 : N — 1) = u(0 : N/2 ~ 1), x e „ e n(0 : N/2 - 1) = x even (0 : N/2 - 1) + x(0 : 

V N/2 - 1), x odd (0 : A72 - ■ 1) = x(0 : N/2 \ - 1) J 

The P-Modc operation of the decoder is quite simple. Use c m = 0, which means that the channel 
lhs are given as an input to the line decoder of the N/2 length polar code. Also provide as input the 
vector Xin(0 : N/2 — 1), that will serve here as estimations of the u bits from the (u + v,v) pairs. Set 
c u = c u ,im and operate simultaneously the auxiliary array of processors and the line-decoder of length 
N/2 in P-Mode. Return the llr output of the line-decoder of length A^/2 and the auxiliary array, i.e. the 
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signals array L(0 : N/2 — 1). 

We now analyze the complexity of the decoder. Let P(n) be the number of processors of the N = 2" 
decoder. Then, P(n) = 2™~ 2 + P(n - 1) P(l) = 1, so P(n) = 2™" 1 = N/2. The number of registers, 
we use in the design for the llrs (not including registers for the input and the encoding registers) are 
R(n) = 2 n - 1 + R(n - 1), R(l) = 1, so we have R(n) = 2 n - 1 = N - 1. The number of multiplexers for 
the llrs is denoted by M{n) = 2™" 1 + M(n - 1) M(l) = 0, so M (n) = N - 2. 

We want to make a remark about the efficiency of the design we propose here. The recursive design has 
a potential advantage of being a clearer reflection of the underlined algorithm. It also has the potential 
advantage of emphasizing the parts of the system that may be reused. However, it may have a disadvantage 
when considering the routing of signals in the circuit. Because we want to use the decoder of the N/2 
length polar code as a closed box, we route all the signals from it and to it, using its interface. This may 
result in some signals traversing a long path before reaching their target processor. These paths may be 
too long for the circuit to have a good clock frequency, thereby resulting in degradation of the achievable 
throughput. It is therefore advised to optimize the circuit by "opening" the recursive units and making 
the paths shorter, after completing the design of the circuit in a recursive manner. It will also be a good 
idea, that when building a decoder for a 2N length code, the designer will use this "optimized" design of 
the N length decoder in the 2N length design, enjoying the benefits of the recursion. We give here two 
examples of these long paths hazards, that we believe that are likely to pose a problem along with their 
possible solutions. 

1. The multiplexers layer at the input of the embedded line decoder of the length N/2 code is required 
because of the introduction of the P-Mode. A closer look of our design, reveals that some of the 
signals have long paths before reaching their target PE. For example, the inputs Ao and Ai need to 
traverse \og 2 {N) — 1 multiplexer layers before reaching their processor. Since the P-Mode needs to 
be accomplished in one time-unit, this long path may be prohibitive. By "opening" the N/2 length 
decoder box, the designer is able to control the lengths of the paths by a proper routing. 

2. The "Encoding Layer" also suffers from long routing. We assumed, in our analysis, that the encoding 
procedure is combinatorial, and therefore can be done within the clock cycle. This may be a problem 
when several encoding circuits are operated one after the other. This is, for example, the case of 
step IV of the decoder of length N/2 1 code, that occurs within the step IV of the decoder of length 
N/2 i ~ 1 code for 1 < i < log 2 N — 2. In this case, O(logiV) operations need to occur in a sequential 
manner in one clock cycle. For large N and high clock frequency circuit, this may not be feasible. 
The idea of Leroux et at [15] was to use flip-flops for saving the partial encoding for each code bit 
in the different layers of the decoding circuit. Each such flip-flop, is connected using a xor circuit to 
the signal line of the estimated information bit. As such, whenever the SC decoder decides on an 
information bit, the flip-flops corresponding to the code bits that are dependent on this information 
bit are updated accordingly. These flip-flops need to be reset whenever we start decoding their 
corresponding outer-code. For example, when we start using the embedded N/2 length decoder (on 
step II and step IV) its flip-flops of partial encoding need to be erased (as they correspond to new 
outer code). 

It should be noted, that this idea may also be described recursively, by changing the specification of 
the length N polar code decoder in S-mode, and requiring it to output the estimated information 
bits as soon as they're ready. The decoder should also have an TV length binary indicator vector, 
that indicates which code bits is dependent on the currently estimated information bit. It is easy to 
see that using the indicator vector of the length N/2 decoder, it is possible to calculate the N length 
indicator vector, by using the (u + v,v) mapping. This, however, generates again a computation 
path of length 0(log N). This problem, can be addressed, by having a fixed indicator circuit for each 
partially encoded-bit flip-flop. This circuit will indicate which information bit should be accumulated 
depending on the ordinal number of this bit. For example, for the decoder of the code of length 
N, we should have an array of N/2 flip-flops, each one corresponds to a bit of the codeword of 
the N/2 length first outer code. Each one of these flip-flops, should have an indicator circuit, that 
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gets as input a value of a counter signaling the ordinal number of the information bit that has 
been estimated, and returns 1 iff its corresponding codeword bit is influenced by this information 
bit. For example, the indicator circuit, corresponding to the first code bit, is a constant 1, because 
xq = "Y^Iq 1 Ui, i.e. it is dependent on all the information bits. On the other hand, the last bit's 
indicator (i.e. of a;jv/2-i) returns 1 iff its input equals to N/2 — 1, because x^/i-i — u n/2-i- Using 
the global counter (that is advanced whenever an information bit is estimated) and the indicator 
circuits, each code bit that is influenced by this information bit will sum it up to its flip-flop. 

Using the Kronecker power form of the generating matrix of the (u + v, v) polar code, it can be seen 
that each of such indicator circuits can be designed by using no more than O(logn) = O(loglogiV) 
AND and NOT circuits, therefore the total cost of these circuits will be of 0(N log log N) in terms 
of space complexity. 

In summary, the recursive architecture may be developed and modified to achieve the timing require- 
ments of the circuit. This may be done by "opening the box" of the embedded decoders, and even altering 
them to support more efficient designs. 

A careful examination of the line-decoder reveals that the auxiliary array is only used on steps I and 
III, and is idle on the other steps. This might motivate us to consider two variations on this design. The 
first one, adds hardware and use these arrays to increase the throughput, while the second one decreases 
the throughput and thereby reduces the required hardware. 

4.2.1 Parallel Decoding of Multiple Codewords 

There are cases that it is required to increase the throughput of the decoder, by allowing parallel decoding 
of multiple codewords. A simple solution is to introduce p decoders when there is a need for decoding p 
codewords simultaneously. Because the auxiliary array of processors is idle most of the time, it seems like 
a good idea to "share" this array among several decoders. By appropriately scheduling the commands to 
to the processors, it is possible to have an implementation of a decoder for p parallel codewords which is 
less expensive than just duplicating the decoders (the naive solution). 

Since the array is idle during steps II and IV, in which the decoder of the length N/2 code is active, 
it is possible to have p < T(n — I) + I = N — I decoders sharing the same auxiliary array. The 
decoding of each one of them is issued in a delay of one clock cycle from each other. Assuming the 
that p = TV — 1, we have a decoding time T(n) + N — 2 = 3N — 4 for N ~ 1 codewords while having 
p • P(n — 1 ) + N/4 = (N — 1)(N — 2) + N/4 processors, which is about half of the number of processors 
in the naive solution. 

This notion can be developed further. For the decoder of the length N/2 code that is embedded in 
the N length decoder, there is a an auxiliary array of JV/8 processors. This auxiliary array is used on 
steps I and III of the decoder of length N and length N/2. Therefore, it is idle most of the time, and we 
can share it among the p decoders of length N/2. Assuming that p = N — 1, we may allocate 3 auxiliary 
arrays that will be shared among the decoders, each one is dedicated for one of these different step: one 
array for step I (and III) of the N length decoder, one array for step I of the N/2 length decoder and 
one array for step III of the N/2 length decoder. For each of the decoded codewords the number of clock 
cycles between these steps is at least p, therefore there will be no contention on these resources and the 
throughput will not suffer because of this hardware reduction. 

In general, for p = N — 1, the auxiliary array within the embedded decoder of length |£ polar decoder 
(i € [I,log 2 (-/V) — 2]), can be shared among the p decoders, provided that we allocate an instance of the 
array for each of the decoding steps it is used in, during the first half of the decoding algorithm for the 
length N code (i.e. for the time of steps I and II). Thus, for this specific array, we have 1 call in step 
I of the N length decoder, 1 call for step I and 1 call for step III of the ^ length decoder, 2 calls for 
step I and 2 calls for step III of the |? length algorithm, 2 l calls for step I and 2 l calls for step III for 
the length ~ decoder. In summary, we need Ylt=o 2* = 2* +1 — 1 auxiliary arrays of processors, each one 
contains ^tt? PEs. In particular, we need N — 1 PEs for the 2 length decoder (each PE is allocated to a 
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specific decoder), and ^ • Y^°=o < ' N ^~ 2 2 ' 2 i+T 1 « ^ (log 2 (7V) — 1) PEs for the other decoders lengths. This 
adds up to approximately y (1 + log 2 (-/V)) PEs. We conclude that this solution allows an increase of the 
throughput in a multiplicative factor of N, while the PEs hardware is only increased by an approximately 
log 2 (iV) factor. Note, that the number of registers should increase by a multiplicative factor of p. 

A closer look at the above hardware design, reveals that we actually allocated for each sub-step of 
steps I and II of the N length decoder a different array of processors. The decoding operation of the p 
codewords will go through these units in a sequential order. However, each decoder should have its own 
set of registers saving the state of the decoding algorithm. Another observation is that when we finish 
decoding the first codeword (i.e. the one we started decoding in time 0), we can start decoding codeword 
number TV in the next time slot (and then codeword number N +1, etc.), in a pipelined fashion. It should 
be noted that Leroux et al. considered a similar idea, and referred to it as the vector-overlapping structure 

4.2.2 Limited Parallelism Decoding 

Another approach for addressing the problem of low utilization of the auxiliary arrays is to limit the 
number of processing elements that may be allowed to operate simultaneously. This is a very practical 
consideration, as typically, a system designer has a parallelism limitation which is due to power consump- 
tion and silicon area. The limited parallelism, inevitably results in an increase of the decoding time, and 
thereby a decrease of the throughput. The line decoder of the code of length N has a PE parallelism of 
N/2, because it may simultaneously compute at most N/2 llrs using the N/2 PEs. 

We consider a line decoder of length N code with limited parallelism of N/2 1 , where i 6 [1, log 2 N]. 
This means, that the decoder has exactly ~ PEs. If i = 1 then the decoder is actually, the standard 
line-decoder. If i > 1 then the decoder's block diagram will be similar to the one shown in Figure [51 with 
the following changes. 

• There will be no auxiliary PEs array. 

• The embedded line decoder of the N/2 length code will be replaced by a limited parallelism line 
decoder, with parallelism factor of N/2 1 . 

• The signals array L(0 : N/4 — 1) at the output of the embedded line decoder will also be connected 
to the registers array R{N/A : N/2 - 1) . 

• The multiplexers array at the the input of the N/2 length line decoder, will change to also include 
the input array X(N/2 : N — 1). This means, that we should have an array of 3 — > 1 multiplexers 
(instead of 2 — > 1), in which the k th multiplexer selects between inputs A(fc), X(k + N/2) and R(k). 

• There will be an additional array of multiplexers at the input of the line-decoder for selecting between 
x(0 : N/4 — 1) and x(N /4 : N/2 — 1), to support the use of both parts of the decided codeword. 
Similarly, for the P-Mode, we should have an array of multiplexers to select between the two parts 
of the Xi n (0 : N/2 — 1) array. 

The S-mode decoding algorithm will have 4 steps as before, however steps I and III are modified as 
follows. 



STEP I 

Sequentially, 

• STEP I-a: At the MUX array, at the input of the (limited parallelism) decoder of the 
length N/2 polar code, set the control signal c m = 0, which means that A(0 : N/2 — 1) 
is selected as an input to this unit. 

Set c u = and use the N/2 length polar code decoder in P-Mode. Store the output 
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array of signals L(0 : N /4 — 1), corresponding to 

L(k) = 2 • tanh -1 (tanh (A(2fc)/2) • tanh (A(2fc + l)/2)) , < k < N/4 - 1 

in the registers array R(N/4 : N/2 — 1). 

• STEP I-b: Set the control signal of the MUX array to c m = 1, which means that 
X(N/2 : N — 1) is selected as an input to this unit. Set c u — and use the decoder 
of the polar code of length N/2 in P-Mode. Store the output signals array 

L(k - N/4) = 2 • tanh -1 (tanh (A(2fc)/2) tanh (A(2fc + l)/2)) , N/4 <k< N/2 - 1 

^ in registers array R(N/4 : N/2 — 1). ^ 

< >> 

STEP III 

Sequentially, 

• STEP Ill-a: At the MUX array, at the input of the (limited parallelism) decoder of 
length N/2 polar code, set the control signal c m = 0, which means that A(0 : N/2 — 1) 
is selected as an input to this unit. 

Set c u = and use the N/2 length polar code decoder in P-Mode. Store the output 
array of signals L(0 : N/4 — 1), corresponding to 

L(k) = (-lfW • A(2fc) + A(2fc + 1) , < k < N/4 - 1 

in the registers array R(N/4 : N/2 — 1). Note that we use x(0 : N/4 — 1), the first 
half of the output from step III, as an input to the N/2 length decoder. 

• STEP Ill-b: Set the control signal of the MUX array to c m = 1, which means that 
X(N/2 : N — 1) is selected as an input to this unit. Set c u = and use the decoder 
of the polar code of length N/2 in P-Mode. Store the output signals array 

L(k - N/4) = (-lfW • A(2fc) + A(2fc + 1), N/4 < k < N/2 - 1 

in registers array R(N/4 : N/2 - 1). Note that we, now, use x(N/4 : N/2 - 1, the 

second half of the output from step III, as an input to the N/2 length decoder. ^/ 

The P-Mode operation of the decoder is also changed, and now contains two steps. " * 

»- s 

STEP I 

At the MUX array, at the input of the (limited parallelism) polar code decoder of length 
N/2 set the control signal c m = 0, which means that A(0 : N/2 — 1) is selected as an input 
to this unit. Set c u = c u .i n and use the N/2 length polar code decoder in P-Modc. Store 
the output signals array L(0 : N/4 — 1) in the registers array R(0 : N/4 — 1). If c u . ln = 1, 
use the first half of the input signals array Xi n (i.e. x in (0 : N/4 — 1)) as an input to the 
^ N/2 length decoder (otherwise this input is ignored). ^ 

-v 

STEP II 

Set the control signal of the MUX array to c m = 1, which means that X(N/2 : N — 1) 
is selected as an input to this unit. Set c u = c u ^ n and use the N/2 length polar code 
decoder in P-Mode. Store the output signals array L(0 : N/4 — 1) in the registers array 
R(N/4 : N/2 — 1). If c u ^ n — 1, use the second half of the input signals array Xi n (i.e. 
Xin{N/4 : N/2 — 1)) as an input to the N/2 length decoder (otherwise this input is ignored). 
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The output of the decoder is the array of signals corresponding to the array of registers 
R(0 : N/2-l). 
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Let's analyze the time complexity of this algorithm. We denote the S-Modc running time (in terms of 
clock cycles) for length N = 2™ polar code with limited parallelism of N/2 1 = 2™ _l , by T(n, n — i). We 
note that T(n, n— 1) = T(n), where T(n) — 2N — 2 is the running time of the standard line decoder. The 
recursion formula is 

T(n, n - i) = 2 • T(n - 1, n - i) + 4 • T p (n -l,n-i), (23) 
where T p (n, to) is the running time of the N — 2 n length decoder with 2 m limited parallelism in P-Mode. 



T p (n, to) 



Therefore, 



It can be shown that 



T p {n, m) 



1, n — m < 1; 

2 • T p (n — 1, to), otherwise. 



1, n — to < 1; 

2"-™" 1 , otherwise. 



T(n,n-i) = 2 ■ N + (i - 2) ■ 2 l 



A>1. 



(24) 



(25) 



(26) 



Equation (|26[) reveals the tradeoff between the number of PEs and the running time of the algorithm. For 
example, decreasing the number of processors by a multiplicative factor of 8, compared to the standard 
case (i.e. i = 4), results in an increase of only 34 clock cycles in the decoding time. We note however, 
that to build such a decoder, additional control hardware (e.g. multiplexers layers) should be designed. 

It seems that for a limited list size, the Successive Cancellation List decoder may also be implemented 
by a line decoder. This requires to duplicate the hardware by the size of the list, M, and to introduce the 
appropriate logic (i.e. comparators and multiplexer layers). It is possible to provide an implementation 
with 0(f(M) ■ N) time complexity, where /(•) is a polynomialy bounded function, that is dependent on 
the efficiency of algorithms for selection of M most likely paths (in the N = 2 decoder) . Furthermore, the 
normalization of the likelihoods should be considered carefully, and also should have its impact on the 
precise (i.e. non asymptotic) time complexity. 



4.3 The BP Line Decoder 

As we saw in Subsection l3.3[ BP is an iterative algorithm, in which messages are sent on the normal factor 
graph representing the code. In this subsection, we consider an implementation of the BP decoder with 
the GCC serial schedule. The proposed decoder structure is a variation on the recursive structure of the 
SC line decoder. Figure [7] depicts a block diagram for this design. The main changes, in respect to the 
SC decoder, are in the memory resources and the processor structure. 

The memory plays a fundamental role in the design, as it helps keeping computed messages within 
the iteration boundary and beyond it. The basic requirement is that each "butterfly" realization of the 
(u + v, v) factor graph, should have memory resources to store its messages. To allow messages to be 
kept within the iteration boundary, it is only required to have one registers array for each length of outer 
code and for each message type. However, the need for keeping a message beyond the iteration boundary 
requires a dedicated memory array for each instance of the outer code. 

In the case of the [u + v, v) code and the GCC schedule, only messages of type need to be kept 
beyond the iteration boundary. We suggest to address this requirement, in the following way. For the 
decoder of length N, we associate a registers matrix fiy n \o : # r {N) — 1,0 : N/2 — 1). Here, # r {N) is 
the number of realizations of factor graphs corresponding to outer codes of size N that exist in our code. 
For the code of length N, there is only one factor graph of this size (i.e. the entire graph), and therefore 
for this decoder # r (^V) = 1. 

Consider, now, the N/2 length decoder that is embedded within the N length decoder. We see in 
FigureEl that this decoder has its number of realizations as 2-# r (N), i.e. for the N length decoder we have 
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Figure 7: Block Diagram for the BP Line Decoder 
2G 



OP- 
De- 
MUX 







a 


out 


BP_PE 








b 



OP- 
MUX 



c BPPE 



(in) 



(*0 



(in) 



(m) 



c opDeMux 

— copMux 



Figure 8: The BP Processing Element 



# r (N/2) = 2. This is because we have two outer codes of length N/2 in the N length code. Therefore, 
the memory matrix associated with it has two rows and N/4 columns. The first row is dedicated for 
the first realization of the outer code and the second row is dedicated for the second realization. Within 
this N/2 length decoder, there is an embedded N/4 length decoder with 2 • jf^ r {N/2) realizations, so in 
this case # r (N/4) — 4. As a result, it has a registers matrix with 4 rows and N/8 columns (each row is 
dedicated to one of the 4 outer codes of length N/4 in this GCC scheme). This development continues, 
until we reach the embedded decoder of length 2, which, by induction, has # r (2) = N/2 realizations for 
the N length decoder, so it requires a registers matrix with N/2 rows and one column. 

For a correct operation of the decoder, it is required to inform the embedded decoders to which 
realization of the outer code's factor graph they are currently referring to. This is the role of the signals 
realizationIDpf/2^ realizationlDiq and OuterCodelD, that indicate the specific realization as follows. 
The signal realizationl D n notifies the decoder of length N, on which realization of the factor graph of 
the code of length N it is working on. Note, that because we describe here a decoder for a code of length 
N, we have only one realization of this graph, therefore this signal is fixed to 0. However, if this was an 
embedded decoding unit within a larger length code decoder, then this signal should indicate the ordinal 
number of the outer code we're decoding, ranging from to # r (iV) — 1. The signal realizationID N / 2 
gives the identification of the realization of the outer polar code of length N/2. It is computed as 
2 • realization! D]y + OuterCodelD, where OuterC odel D equals on step II, and equals 1 on step IV of 
the iteration for the N length decoder. 

We also need to have registers arrays for the messages of type /i ei _> Qo , /i ao _>. ei , jiti , and fiv , 

each one of them of length N/2. We denote them by u ei ^ aQ (0 : N/2 - 1), /i ao ^. ei (0 : N/2 - l),^ m) (0 : 
N/2 — 1), Hu (0 ■ N/2 — 1) and ni° ut \o : N/2 — 1). Note, that as opposed to the memory structure for 
the jjiv^ messages, these arrays do not need to be available beyond the iteration boundary, therefore it 
suffices to have them as arrays and not matrices. Furthermore, the arrays for messages ^ ei _> 0o , fi^u^ and 
fii° ut \ can be replaced by one temporary array. However, in the description of the hardware structure, 
we chose not to do this, in order to keep the discussion more comprehensible. 

Figure EJ depicts the processing element BP_PE that is considered here. This unit has two inputs 
for message llrs, and depending on the control signal cbppe it performs either the /(+)(•, ■) function or 
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the /(=)(•, •)• Because it has to implement the functionalities of equations (fT3 j) -(fT8 |) . we introduce routing 
layers for the inputs (OP-MUX) and the outputs (OP-De-MUX) that ensure that the proper inputs will be 
given to the processor and that its output is stored in the appropriate array, depending on the computation 
schedule of the iteration. 

Besides the messages that serve as inputs or outputs to the processor, we allocate two additional 
message inputs, denoted by ext a and ext D , and one additional output message, denoted by ext out . These 
inputs and output are used during the P-Mode of the decoder. We note, that in Figure [71 we preferred, 
for brevity, not to specify these routing units for each processor, but rather to group them into routing 
arrays. The inputs and outputs to these routing arrays are arrays of inputs and outputs corresponding to 
the types of inputs and outputs that appear in Figured] The convention is that in these routing arrays, 
the i th output corresponds to the i th input from each signals array (the signals array is selected by the 
control signal of the routing array). Moreover, the i th output of the OP-MUX array corresponds to the 
consecutive i th processor from the array of processors it serves. Similarly, the i th input of the OP-De-MUX 
array corresponds to the i th consecutive processor from the array of processors it serves. 

As in the SC case, the BP line decoder has two operation modes. 

• S-Mode - The decoder gets as input fjy n > type of messages referring to its inputs. It outputs, fi(° ut > 
type of messages, corresponding to its inputs (i.e. messages, that are sent from the subgraph which 
realization the decoder is operating on, to its neighbors) and estimation of the information word 
vector (denoted by infoEst). 

• P-Mode - The decoder serves as an array of N/2 processors and performs simultaneously the com- 
putation of the type of message indicated by Cppe, external using signals ext a and ext as the inputs 
and ext ou t as the output. 

The S-Mode decoding algorithm operates as follows. 

. 

STEP I 

Simultaneously, 

• At the MUX array, at the input of the decoder of the code of length 2V/2, set the 
control signal c m = 0, which means that the OP-MUX array is selected as the input 
to the decoder. Set c opM ux such that /^ m) (0 : N/4 - 1) and ^ (0 : N/4 - 1) 
will be selected as the first input and the second input, respectively of this unit. 
Set cbppe, internal to correspond to the computation of (|13p and use the N/2 length 
polar code decoder in P-Mode. Set the OP-De-MUX array to direct the output to 
M ei -oo(0: JV/4-1). 

• Having the same values for c op mux and cbppe, internal, use the auxiliary array of 
processors to operate on the inputs ^ v m \N/4 : N/2 — 1) and (ixi(N/4 : N/2 — 1) 
and have the output directed to fJ> ei - >ao (N/4 : N/2 — 1). 

Simultaneously, 

• Keep Cm — 0. Set c op uux such that [i£l : N/4 — 1) and fi ei ^. ao (Q : N/4 — 1) will 
be selected as the first input and the second input, respectively to the N/2 length 
decoder. Set Cbppe, internal to correspond to the computation of (fT5|) and use the 
decoder of length N/2 in P-Mode. Set the OP-De-MUX array to direct the output 
to n ( u out) (0: N/4-1). 

• Having the same values for c op mux and cbppe. internal, use the auxiliary array of 
processors to operate on the inputs /j,xo\n/4 : N/2 — 1) and /i ei _j. ao (iV/4 : N/2 — 1) 
and have the output directed to fi u ° ut \N/4 : N/2 — 1). 
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STEP II 



• At the MUX array, at the input of the BP decoder of the N/2 length polar code, set 
the control signal c m = 1, which means that the input from the second multiplexer 
is selected as input to this unit. Specifically, since OuterC odel D — it means that 
fJ-u U '\o ■ N/2 — 1) is the input to this decoder. 

• Provide the indices of the the frozen bits from the first half of the codeword to the N/2 
length decoder, and operate it in S-Mode. Store the estimation of the information 
word (output signals array infoSet) to the bits array u(0 : N/2 — 1). Direct the output 
messages to be saved in flu (0 : N/2 — 1), using the de-mux that is connected to the 
outMessages signals array, at the output of the N/2 length decoder. 

STEP III 

Simultaneously, 

• At the MUX array, at the input of the BP decoder of the N/2 length polar code set 
the control signal c m = 0, which means that the OP-MUX array is selected as the 
input to the decoder. Set c op mux such that fixo\o : N/4 — 1) and /ii m ^(0 : N/4 — 1) 
will be selected as the first input and the second input, respectively to this unit. Set 
cbppe, internal to correspond to the computation of (|T¥|) . and use the N/2 length 
decoder in P-Mode. Set the OP-De-MUX array to direct the output to the array 
Mao- ei (0:iV/4-l). 

• Having the same values for c op mux and cbppe. internal, use the auxiliary array of 
processors to operate on the inputs (1xo\n/4 : N/2 — 1) and fi u m '{N/4 : N/2 — 1) 
and have the output directed to H ao - ¥ei (N/4 : N/2 — 1). 

Simultaneously, 

• Keep c rn = and change c op mux such that jU^i (0 : N/ 4 — 1) and fi ao ^ ei (0 : N/4 — 1) 
will be the first input and the second input, respectively to the N/2 length decoder. 
Set cbppe, internal to correspond to the computation of (fT6|) and use the N/2 length 
decoder in P-Mode. Set the OP-De-MUX array to direct its output to the array 
l&*\0: N/4-1). 

• Having the same values for c op mux and cbppe. internal, use the auxiliary array of 
processors to operate on the inputs ^xi\n/4 : N/2 — 1) and fi ao ^ ei (N / 4 : N/2 — 1) 
and have the output directed to fj,° at (N/4 : N/2 - 1). 

STEP IV 

• At the MUX array, at the input of the decoder of the code of length N/2, set the 
control signal c m = 1, which means that the input from the second multiplexer is 
selected as an input to this unit. Also set OuterC odel D = 1, which means that 
fii° ut \o : N/2 — 1) is the input to this decoder. 

• Provide the indices of the the frozen bits from the second half of the codeword to the 
N/2 length decoder, and operate it in S-Mode. Perform the decoding of the second 
outer polar code of length N/2. Save the estimation of the information word (output 
signals array infoSet) to the bits array u(N/2 : N — l). Direct the output messages to 
be stored in fti m ^ (0 : N/2 — 1), using the de-mux that is connected to the outMessages 
signals array, at the output of the N/2 length decoder. 

Simultaneously, 
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• At the MUX array, at the input of the decoder of the code of length N/2, set the 
control signal c m = 0. Set c op mux such that (0 : N/4 - 1) and ^ (0 : N/4 - 1) 
will be selected as the first input and the second input, respectively of this unit. 
Set cbppe, internal to correspond to the computation of (|13p and use the N/2 length 
polar code decoder in P-Mode. Set the OP-De-MUX array to direct the output to 
M ei -oo(0: JV/4-1). 

• Having the same values for c op mux and cbppe, internal, use the auxiliary array of 
processors to operate on the inputs ^ n \N/4 : N/2 — 1) and (i x ™(N/4 : N/2 — 1) 
and have the output directed to fJ> ei ^ ao (N/4 : N/2 — 1). 

Simultaneously, 

• Keep c m = and change c op uux such that /J,u n \o ■ N/4— I) and /i ei _> ao (0 : N/4—1) 
will be the first input and the second input, respectively to the N/2 length decoder. 
Use the polar code decoder of length N/2 in P-Mode, and set cbppe, internal to 
correspond to the computation of ([TT]) . Set the OP-De-MUX array to direct the 
output to Atio Ut) (0 : N/4 - 1). 

• Having the same values for c op mux and Cbppe, internal , use the auxiliary array of 
processors to operate on the inputs ^u n \N/4 : N/2 — 1) and /j, ei ^, ao (N/4 : N/2 — 1) 
and have the output directed to ^°f(N/4 : N/2 - 1). 

Simultaneously, 

• Keep c m — and change c op uux such that fj,v n \o : N/4 — 1) and^ ao ^ ei (0 : N/4—1) 
will be the first input and the second input, respectively to the N/2 length decoder. 
Set cbppe, internal to correspond to the computation of and use the N/2 length 
polar code decoder in P-Mode. Set the OP-De-MUX array to direct the output to 
M £ ut) (0: JV/4-1). 

• Having the same values for c op mux and cbppe, internal, use the auxiliary array of 
processors to operate on the inputs p!// n \N/4 : N/2 — 1) and ^ ao _> ei (iV / '4 : N/2 — 1) 

^ and have the output directed to (i° u i t (N/4 : N/2 - 1). ( 

The output of the decoder, in S-mode, will be the array u(0 : N — l) and the two arrays fi Xo (0 : N/2 — 1) 
and /Lt a;i (0 : N/2 — 1) interleaved. This means, that the output message vector is an array in which 
the entries with even indices are from /j, Xo (0 : N/2 — 1) and the entries with the odd indices are from 
/<,(0:.V-_> 1). 

In the P-Mode, the decoder serves as an array of N/2 processors that operate in parallel, the control 
signal Cbppe, external indicates which operation is performed on all the processors. The inputs to the 
processor are denoted by the signals arrays ext a (0 : N/2 — 1) (the first input) and ext b (0 : N/2 — 1) (the 
second input). The output is directed to the signals array ext out (0 : N/2 — 1). The P-Mode decoding 
algorithm operates as follows. 



r 



Simultaneously, 



At the MUX array, at the input of the BP-decoder of the polar code of length N/2, 
set the control signal c m = 0, which means that the OP-MUX array is the input of 
the decoder. Set c op mux such that ext a (0 : N/4 — 1) and extb(0 : N/4 — 1) will 
be the first input and the second input, respectively. Use the polar code decoder of 
length N/2 in P-Mode, and set Cbppe, internal to be equal to C B pp,extemai- Have 
the OP-De-MUX array to direct the output to ext out (0 : N/4 - 1). 

Having the same values for c op mux and cbppe, internal, use the auxiliary array of 
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I processors to operate on the inputs ext a (N/4 : N/2 — 1) and extb(N/4 : N/2 — 1) J 

\v and have the output directed to ext out ( N/4 : N/2 — 1). J 

Let us, now, consider the time complexity (in terms of the number of clock cycles consumed by an 
iteration) of this design. As before, let T(n) be the time complexity of the decoder of the polar code of 
length N = 2™. We assume that each call to a PE requires one clock cycle. In our design, we therefore 
have 

T(n) = 2 • T(n - 1) + 7, (27) 

and T(l) = 4, so T(n) = 5.5 • N — 7 = Q(N). The memory consumption, however is <d(N ■ log AT), 

because of the memory matrices for the p,v° type of messages. The number of processing elements in 
this design is N/2. It should be noted, that the suggested processor can be further improved to support 
some operations to occur in parallel. For example, if the PE could run one operation of /+(■) and one 
operation of /=(•) in parallel, we could have the two last operations in step IV to be done in one clock cycle, 
therefore reducing the free addend in (|27j) to 6. Further reduction would be achieved if one could perform 
/+(■) and direct its output to /=(■) in one clock cycle. This will result in joining the two operations in 
step III, into one operation. Allowing the computation of /=(•) and directing its output to /+(•) in the 
same clock cycle, will results in consolidation of the two operations of step I into one operation (actually, 
the latter change may also allow to consolidate the second and third computation in step IV, making 
the first change redundant). These changes will result in 4, as the free addend in (f27|) and T(2) = 2, so 
T(n) = 3 • N — 4. Naturally, these changes require the appropriate amendments in the routing units, that 
we described before. 

We want to note here that the remarks, which we made on the SC line decoder at the end of Subsection 
14.21 also apply here. Specifically, the long paths hazards, requiring a more efficient designs by opening 
the recursive boxes is also relevant for the BP decoder, specifically for the routing layers in P-Mode. 
Furthermore, the issue of idle clock cycles for the PEs is also a problem of this design and the solution of 
Subsections 14.2.11 and I4.2T21 may be adapted to this decoder too. However, while in the SC decoder, the 
existence of inactive PEs is due to the properties of the SC algorithm, which dictates the scheduling of 
the message computation, in the BP case, this is due to the scheduling we choose and not a mandatory 
property of the algorithm. Other types of scheduling do exist, and currently there is no evidence which 
scheduling is better (for example, in terms of the achieved error rate or in terms of the average number 
of iterations required for convergence). Hussami et al. |12j proposed to use the Z-shape schedule, which 
description suggests a constant level of parallelism of N PEs (of the type we considered here) operating all 
the time. This seems to give the Z-shape schedule an advantage over the GCC schedule if the number of 
processors is not limited (unless the technique of Subsection 14 . 2 . ll is applied). It is an interesting question 
to find which schedule is better, when the number of processors is limited. This is a matter for further 
research. 



5 Hardware Architectures for General Kernels 

So far, we described algorithms for decoding of polar codes in a recursive way. This notion has enabled us 
to restate the hardware implementation for SC for Arikan's construction, that were proposed by Leroux 
et al. 15 . In addition, we suggested an implementation for BP decoding for the GCC schedule. In this 
section, we would like to generalize these constructions for other types of kernels. Because we already 
covered the implementation for Arikan's codes in some details, we will be more brief in this section, mainly 
emphasizing the principle differences from the designs in Section |4l 

5.1 Recursive Description for the SC Line Decoder for General Kernels 

Figure [9] depicts a block diagram for a SC line decoder for a general linear kernel of dimension £, over 
alphabet F. This kernel has an £ x £ generating matrix, G associated with it. We assume, that this 
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Figure 9: Block Diagram for the SC Line Decoder for a General Linear Kernel of Dimension I 
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decoder has the same requirements for the inputs and outputs, that were given for the (u + v, v) line 
decoder in Subsection 14.21 

The basic processing element of this design (denoted by PE), gets I llr functions (each function is of 
\F\ — 1 values), and the coset vector that reflects the previous stages decisions. The control signal c u 
indicates which type of llr function the processor should output. There are £ types of computations that 
the processor should support according to the different stages of decoding, as ([2]) implies. Since we consider 
here a linear kernel, when decoding outer code number fc, the assumption on ujj -1 (the information sub- 
vector input to the kernel) is manifested by the coset vector, which this sub-vector induces. This coset 
vector is generated by Uq _1 • G_>.(o:fc-i) , where G_j.( :fc-i) is a matrix containing only the first k rows of G. 
This coset vector is gradually computed and maintained in the registers array x(0 : N — 1), as we explain 
in the sequel. We note, that if the kernel is not linear, then each processor should get the previously 
decided bits associated with it, i.e. the estimated sub- vector Uq -1 , in order to perform ([2]). 

The way the llr computations of ^ are done is an important question, that we do not elaborate on 
here. For example, it may be beneficial to consider trellis implementation, of the decoding stages, or even 
consider using approximations of it, such as min-sum rule |15j . or near ML decoding variants, such as 
order statistics or box and match |25j . 

Since the outer codes in this design are of length N/£, the processors in the preparatory steps of the 
SC algorithm (i.e. steps 2 • r — 1, as defined in Section \S§ should generate N/£ llr functions, serving as 
inputs to the decoder of the outer code. Therefore, to have the maximum level of parallelism we use 
N/£ PEs in the decoder. The embedded N/£ length recursive decoder is able to contribute only N/£ 2 
processors, so the auxiliary array of processors needs to supply the rest of the processors, i.e. it should 
have N/£ - N/£ 2 additional processors. 

The encoding unit gets the decisions on the codewords of the outer codes from the N/£ length decoder. 
Using these decisions, it computes the estimated coset vectors of the inner codes. To support this, we use 
the signal outerCodeW, that identifies which outer-code is currently decoded. At the end of step 2 • r, we 
have outerCodelD = i — 1, because we just finished decoding outer code number r — 1, by the N / £ length 
decoder. This decoder outputs the estimation of the codeword using the signals vector i(0 : N/£ — 1). 
Now, the encoding layer performs the following operation, for < i < N/£ — 1, 

x (I ■ i : I ■ (i + 1) - 1) = x {£ ■ i : I ■ {i + 1) - 1) + x(i) • G^ r _i. (28) 

This means, that we add row number r — 1 of G, multiplied by the symbols of the recently estimated 
outer codeword, to the previously estimated coset vectors (Note, that we have N/l coset vectors, such 
that x (£ ■ i : £ ■ (i + 1) — 1), corresponds to the i th inner code, < i < N/l — 1). At the end of step 2 • £, 
the output of the encoding layer is the estimation of the codeword. 

As in the (u + u, v) line decoder, we have two operation modes. The first one is S-Mode, in which 
the decoder gets llr functions and the indices of the frozen symbols, and outputs the hard decisions on 
the information word and its corresponding codeword. The second one is P-Mode, in which the decoder 
operates as an array of processors and performs the same type of operation according to the signal c u . 

In S-Mode, we have £ pairs of computation steps, as described below (1 < r < £). 



STEP 2 • r - 1 

Simultaneously, 

• At the MUX array, at the input of the decoder of the polar code of length N/£ , set 
the control signal c m = 0, which means that the array A(0 : N/£ — 1) is selected as 
an input to this unit. Set c u = r — 1 and supply the coset vectors x(0 : N / £ — 1) 
to the unit (the latter is achieved because modeln — 0). Use the decoder of the 
polar code of length N/£ in P-Mode. This means that the processors will perform 
the computation of the llrs of type r — 1 according to where k = r. The values 
of the computations are stored in the registers array i?(0 : N/£ 2 — 1). 
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• Use the auxiliary array of processors, and perform the same computations given the 
rest of the llrs array, X(N/£ : N — 1), and the rest of the cosets vector, x(N/£ : N — 1). 
The outputs of the computations are stored in the registers array R(N/l 2 : N/l — 1). 

STEP 2 r 

• At the MUX array, at the input of the decoder of the polar code of length N/l, 
set the control signal c m = 1, which means that the values of the registers array 
R(0 : N/l — 1) are inputs to this unit. 

• Provide to the N/l length polar code decoder, the indices of the frozen symbols from 
the range [(r — 1) • N/l, r ■ N/l — 1]. Operate the N/l length polar code decoder in S- 
Mode, which results in decoding of the outer code number r — 1. Store the estimated 
information word in the following way u((r — 1) • N/l : r ■ N/l — 1) = u(0 : N/l — 1). 
Perform the computation of the coset vector as defined in (|2"5)l . 

If this is the last step (i.e. r = I), then we give as output the content of u(0 : N — 1) 
and the content of x(0 : N — 1) (To avoid the sampling delay due to the registers, 
we prefer to give as output [u(0 : N - £ + 1) u(0 : N/l - 1)] instead of u(0 : N - 1), 

\^ and the output of the encoding layer block instead of x(0 : N — 1)). ^/ 

The P-Mode operation is quite straight forward. We have the signal model n — 1, which indicates that 
the N length decoder operates in P-Mode. This causes the input cosets vectors (denoted by the signals 
array Xi n (0 : N— 1)) to be routed to the processors (instead of the internal cosets vector x(0 : N — l)). The 
embedded N/l length decoder operates in P-Mode (i.s. mode = 1, as well). As a result, both the auxiliary 
array of processors and the embedded N/l length decoder computes the operation that is indicated by 
the signal c u ^ n , and output the computations results using the signals array L(0 : N/l — 1). 

The complexity analysis is also quite simple. As an example, if we assume that the processor requires 
c clock cycles to complete the computation of each of its I stages, then we have for N — l n length 
code, T(n) = I ■ T(n - 1) + I ■ c and T(l) = I ■ c, so T(n) = c-{N + l- (log,, N - 1)) clock cycles. The 
number of R registers for holding the lh functions (each function contains \F\ — 1 values) can be shown 
to be NY^°={ N 1 1~ % = N ■ 1 ■ The long routing path hazard that we raised in the context of the 

(u + v, v) decoder, may also be of concern here. Therefore, our suggestion to open the recursion boxes 
and to optimize them accordingly, may be relevant here as well. The ideas of sharing the auxiliary array 
of processors for increasing the throughput, or decreasing the parallelism studied in Subsections 14.2. ll and 
14.2.21 respectively, are also applicable here with the obvious adaptations. 



5.2 About Decoders for Mixed Kernels and General Concatenated Codes 

So far, we considered decoders for homogenous kernels that may be non-binary. These codes have the 
nice property, that the outer codes in their GCC structure are themselves polar codes from the same 
family (but shorter ones). Therefore, we were able to use a single embedded decoder of a code of length 
N/l within the decoder of the code of length N. This embedded decoder is used I times, each time with 
different inputs (i.e. indices of the frozen symbols and the input messages). This property no longer 
applies when mixed kernels are employed. 

Consider, for example, the I — 4 dimension mixed kernel that we presented in one of our previous 
papers [7]. In the decoder of the mixed code of length N = 4™, we should have an embedded decoder of 
the mixed code of length N/4, and an additional embedded decoder for the RS(4) polar code of length 
N/4. It should be noted, however, that even here, a reuse of hardware is still possible, as the decoder 
for the RS(4) of length N/4, requires an embedded decoder for the RS(4) of length TV/16 within it. The 
latter decoder (and its embedded decoders) can be shared with the decoder for mixed code of length N/ 4 
(that requires an embedded RS(4) decoder of the same length). 

A further step in generalization of this structure, is the general concatenated structure, in which the 
outer codes are not required to be polar codes. This means, that other types of codes may be used with 
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their corresponding decoding algorithms. Examples of such structures using BCH codes and near ML 
decoding algorithms, were recently described by Trifonov [5]. In these types of constructions, we need to 
have a separate decoding unit for each outer code. As in the cases of the mixed kernels, if the outer codes 
share structure and decoding algorithm these resources may be reused, thereby enabling a more efficient 
design. 

Summary and Conclusions 

We considered the recursive GCC structures of polar codes which led to recursive description of their 
decoding algorithms. Specifically, known algorithms (SC and SCL) were formalized in a recursive fashion, 
and then were generalized for arbitrary kernels. The BP decoding algorithm the with the GCC schedule 
was also depicted. Then, recursive hardware architectures for these algorithms were considered. Wc 
restated known architectures, and generalized them for arbitrary kernels. 

In our discussion, we preferred for brevity, to give somewhat abstract descriptions of the subjects, 
emphasizing the main properties while neglecting some of the technical details. However, a complete 
hardware design requires a full treatment of all of these details (as was done by Leroux et al. for the 
(u + v, v) case |15|). We intend to verify this design for arbitrary kernels in a further work. 

Another issue, that needs a more careful attention, is the BP decoder, and specifically the proposed 
GCC schedule. A comparison between it and other proposed schedules (e.g. the Z shaped schedule) 
is an interesting question, which is also a subject for further research. The usage of BP decoder for 
arbitrary kernels is another interesting problem, that also worth further studying. For these kernels, the 
way to compute the messages is well understood. However, the question of an appropriate schedule that 
enables the convergence of the algorithm, is not clear. We note however, that for a specific kernel, if 
such a schedule exists it may be beneficial to try to define it in a recursive manner, thereby enabling the 
utilization of the approach in this paper to construct a decoding hardware for it. 
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